TA-DRD: A Three-step Automatic Duplicate Record Detection

نویسندگان

  • Yongquan Dong
  • Ping Ling
  • Yali Liu
  • Qiang Chu
چکیده

Duplicate record detection is a key step in Deep Web data integration, but the existing approaches do not adapt to its large-scale nature. In this paper, a three-step automatic approach is proposed for duplicate record detection in Deep Web. It firstly uses cluster ensemble to select initial training instance. Then it utilizes tri-training classification to construct classification model. Finally, it uses evidence theory to combine the results of multiple classification models to construct the domain-level duplicate record detection model which can be used for large-scale duplicate record detection in the same domain. Experimental results show that the proposed approach is better than previous work and and the domainlevel duplicate record detection model can get high performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Method for Duplicate Detection Using Hierarchical Clustering of Records

Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...

متن کامل

Duplicate detection algorithms of bibliographic descriptions

Purpose – The purpose of this paper is to focus on duplicate record detection algorithms used for detection in bibliographic databases. Design/methodology/approach – Individual algorithms, their application process for duplicate detection and their results are described based on available literature (published articles), information found at various library web sites and follow-up e-mail commun...

متن کامل

Automatic Detection of Microaneurysms in Color Fundus Images using a Local Radon Transform Method

Introduction: Diabetic retinopathy (DR) is one of the most serious and most frequent eye diseases in the world and the most common cause of blindness in adults between 20 and 60 years of age. Following 15 years of diabetes, about 2% of the diabetic patients are blind and 10% suffer from vision impairment due to DR complications. This paper addresses the automatic detection of microaneurysms (MA...

متن کامل

Record Matching Over Query Results Using Fuzzy Ontological Document Clustering

Record matching is an essential step in duplicate detection as it identifies records representing same real-world entity. Supervised record matching methods require users to provide training data and therefore cannot be applied for web databases where query results are generated on-the-fly. To overcome the problem, a new record matching method named Unsupervised Duplicate Elimination (UDE) is p...

متن کامل

A Comparative Study in Classification Techniques for Unsupervised Record Linkage Model

Problem statement: Record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. A variety of record linkage algorithms with different steps have been developed in order to detect such duplicate records. To find out whether two records are duplicate or not, supervised and unsupervised classification techniques are utilized in ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015